Spam Filtering Using Character-Level Markov Models: Experiments for the TREC 2005 Spam Track

نویسندگان

  • Andrej Bratko
  • Bogdan Filipic
چکیده

This paper summarizes our participation in the TREC 2005 spam track, in which we consider the use of adaptive statistical data compression models for the spam filtering task. The nature of these models allows them to be employed as Bayesian text classifiers based on character sequences. We experimented with two different compression algorithms under varying model parameters. All four filters that we submitted exhibited strong performance in the official evaluation, indicating that data compression models are well suited to the spam filtering problem.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Spam Filtering Using Compression Models

Spam filtering poses a special problem in text categorization, of which the defining characteristic is that filters face an active adversary, which constantly attempts to evade filtering. Since spam evolves continuously and most practical applications are based on online user feedback, the task calls for fast, incremental and robust learning algorithms. This paper summarizes our experiments for...

متن کامل

Batch and Online Spam Filter Comparison

In the TREC 2005 Spam Evaluation Track, a number of popular spam filters – all owing their heritage to Graham’s A Plan for Spam – did quite well. Machine learning techniques reported elsewhere to perform well were hardly represented in the participating filters, and not represented at all in the better results. A non-traditional technique Prediction by Partial Matching (PPM) – performed excepti...

متن کامل

DalTREC 2005 Spam Track: Spam Filtering Using N-gram-based Techniques

We briefly describe DalTREC 2005 Spam submission. DalTREC is the TREC research project at Dalhousie University. Four packages were submitted and they resulted in a median performance. The results are interesting and may be seen positive in the light of simplicity of our approaches.

متن کامل

BUPT at TREC 2006: Spam Track

This report summarizes our participation in the TREC 2006 spam track, in which we consider the use of Bayesian models for the spam filtering task. Firstly, our anti-spam filter, Kidult, is briefly introduced. And then we try to use weighted adjustment of separating hyperplane and selective classifiers ensemble to improve the filtering performance. Finally, we summarize the relevant results from...

متن کامل

Document and Query Expansion Models for Blog Distillation

This paper presents the CMU submission to the 2008 TREC blog distillation track. Similar to last year’s experiments, we evaluate different retrieval models and apply a query expansion method that leverages the link structure in Wikipedia. We also explore using a corpus that combines several different representations of the documents, using both the feed XML and permalink HTML, and apply initial...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005